1
對比資料利用模式:標註範疇
EvoClass-AI003Lecture 10
00:00

對比資料利用模式:標註範疇

機器學習模型的成功部署,關鍵在於標註資料的可取得性、品質與成本。在人工標註昂貴、不可行或高度專業化的環境中,傳統模式會變得效率低下甚至完全失效。我們提出『標註範疇』的概念,根據資訊使用方式區分出三種核心方法: 監督式學習(SL)非監督式學習(UL)以及 半監督式學習(SSL)

1. 監督式學習(SL):高準確度,高成本

監督式學習在每個輸入 $X$ 都明確配對已知真實標籤 $Y$ 的資料集上運作。雖然此方法通常能為分類或迴歸任務帶來最高的預測準確度,但其對密集且高品質標註資料的依賴,使得資源消耗極大。若標註樣本稀少,性能會急劇下降,導致該模式脆弱不堪,對於龐大且持續演變的資料集而言,經濟上常難以維持。

2. 非監督式學習(UL):潛在結構探勘

非監督式學習僅處理未標註資料 $D = \{X_1, X_2, ..., X_n\}$。其目標是推斷資料流形內的固有結構、底層機率分布、密度,或有意義的表示方式。主要應用包括聚類、流形學習與表示學習。非監督式學習在資料前處理與特徵工程方面極具成效,能在無需外部人為介入的情況下提供寶貴洞見。

Question 1
Which learning paradigm is designed specifically to mitigate high reliance on expensive human data annotation by utilizing abundant unlabeled data?
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Reinforcement Learning
Question 2
If a model's primary task is dimensionality reduction (e.g., finding the principal components) or clustering, which paradigm is universally employed?
Supervised Learning
Semi-Supervised Learning
Unsupervised Learning
Transfer Learning
Challenge: Defining the SSL Objective
Conceptualizing the Combined Loss Function
Unlike SL, which optimizes solely based on labeled fidelity, SSL requires a balanced optimization strategy. The total loss must capture prediction accuracy on the labeled set while enforcing consistency (e.g., smoothness or low density separation) across the unlabeled set.

Given: $D_L$: Labeled Data. $D_U$: Unlabeled Data. $\mathcal{L}_{SL}$: Supervised Loss function. $\mathcal{L}_{Consistency}$: Loss enforcing prediction smoothness on $D_U$.
Step 1
Write the general form of the total optimization objective $\mathcal{L}_{SSL}$, incorporating a weighting coefficient $\lambda$ for the unlabeled consistency component.
Solution:
The conceptual form of the total SSL loss is a weighted sum of the two components: $\mathcal{L}_{SSL} = \mathcal{L}_{SL}(D_L) + \lambda \cdot \mathcal{L}_{Consistency}(D_U)$. The scalar $\lambda$ controls the trade-off between label fidelity and structure reliance.